Computer product and method for sparse matrices

ABSTRACT

A computer program product and method for multiplying a sparse matrix by a vector are disclosed. The computer program product includes a computer readable medium for storing instructions, which, when executed by a computer, cause the computer to efficiently multiply a sparse matrix by a vector, and produce a resulting vector. The computer is made to create a first array containing the non-zero elements of the sparse matrix, and a second array containing the end_of_row position of the last non-zero element in each row of the sparse matrix. A variable is initialized, and then, for each row of the second array, the computer is made to do one of two things. Either, it equates the variable to the sum of the variable and the product of a particular element of the first array and a particular element of the vector. Or, it equates a particular element of the resulting vector to the variable, and then equates the variable to a particular value.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] This invention relates generally to computers, and moreparticularly, to computer program products and methods for causing acomputer to function in a particular efficient fashion.

[0003] 2. Description of the Related Art

[0004] Modern computers contain microprocessors, which are essentiallythe brains of the computer. In operation, the computer uses themicroprocessor to run a computer program.

[0005] The computer program might be written in a high-level computerlanguage, such as C or C++, using statements similar to English, whichstatements are then translated (by another program called a compiler)into numerous machine-language instructions. Or the program might bewritten in assembly language, and then translated (by another programcalled an assembler) into machine-language instructions. In practice,every computer language above assembly language is a high-levellanguage.

[0006] Each computer program contains numerous instructions, which tellthe computer what precisely it must do, to achieve the desired goal ofthe program. The computer runs a particular computer program byexecuting the instructions contained in that program.

[0007] Frequently the goal of the program is to solve complicated realworld problems which can be described in mathematical terms. Modernmicroprocessors permit such programs to be rapidly executed usingtechniques such as pipelining and speculative execution.

[0008] Modern microprocessors use a design technique called a pipeline,in which the output of one process serves as input to a second, theoutput of the second process serves as input to a third, and so on,often with more than one process occurring during a particular computerclock cycle.

[0009] Pipelining is a method used in some microprocessors of fetchingand decoding instructions in which, at any given time, several programinstructions are in various stages of being fetched or decoded. Ideally,pipelining speeds execution time by insuring that the microprocessordoes not have to wait for instructions; when it completes execution ofone instruction, the next is ready and waiting. In order to have thenext instruction that is to be executed ready and waiting in thepipeline, the microprocessor somehow must predict what that instructionwill be.

[0010] Branch prediction is a technique used in some microprocessors toguess whether or not a particular path in a program—called a branch—willbe taken during program execution, and to fetch instructions from theappropriate location. When a branch instruction is executed, it and thenext instruction executed are stored in a buffer. This information isused to predict which way the instruction will branch the next time itis executed. When the prediction is correct, executing a branch does notcause a pipeline break, so the system is not slowed down by the need toretrieve the next instruction. When the prediction is incorrect, apipeline break does occur, and the system is slowed down because it thenneeds to locate and retrieve the next instruction. Such incorrectpredictions are sometimes called branch mispredictions.

[0011] Speculative execution is a technique used in some microprocessorsin which certain instructions are executed and results made availablebefore the results are actually needed by the program, so that theresults, are ready and waiting when the program needs them. Whichinstructions are to be executed speculatively is based on the guessesmade about which branches in the program will be taken. In general, whena branch is mispredicted and instructions speculatively executed basedon that incorrect branch prediction, the results of the speculativelyexecuted instructions must be discarded, and consequently the computertime and resources used to obtain the now discarded results are wasted.

[0012] Real-world problems frequently can be expressed mathematicallyusing a group of equations generally referred to as a system ofsimultaneous equations. Those equations, in turn, can be expressed inwhat is sometimes called matrix form, described more fully below. Acomputer can then be used to manipulate and perform calculations withthe matrices, and solve the problem.

[0013] A matrix is a set of numbers arranged in rows and columns so asto form a rectangular array. The numbers are called the elements of thematrix. If there are m rows and n columns, the matrix is said to be “mby n” matrix, written “m×n”. For example, $\begin{bmatrix}1 & 3 & 8 \\2 & {- 4} & 5\end{bmatrix}\quad$

[0014] is a 2×3 matrix; it has two rows, and three columns. A matrixwith m rows and m columns is called a square matrix of order m. Anordinary number can be regarded as a 1×1 matrix; thus, the number 3 canbe thought of as the matrix [3].

[0015] In a common notation, a capital letter denotes a matrix, and thecorresponding small letter with a double subscript denotes an element ofthat matrix. Thus, a_(ij) is the element in the ith row and the jthcolumn of the matrix A. If A is the 2×3 matrix shown above, then a₁₁equals 1, a₁₂ equals 3, a₁₃ equals 8, a₂₁ equals 2, a₂₂ equals −4, anda₂₃ equals 5. Under certain conditions described more fully below,matrices can be added and multiplied as individual entities.

[0016] Matrices occur naturally in systems of simultaneous equations. Inthe following system for the unknowns x and y,

2x+3y=7

3x+4y=10,

[0017] the array of numbers $\begin{bmatrix}2 & 3 & 7 \\3 & 4 & 10\end{bmatrix}\quad$

[0018] is a matrix whose elements are the coefficients of the unknowns.The solution of the equations depends entirely on these numbers and ontheir particular arrangement. If 7 and 10 were interchanged, thesolution would not be the same.

[0019] A matrix A can be multiplied by an ordinary number c, which iscalled a scalar. The product is denoted by cA or Ac, and is the matrixwhose elements are ca_(ij).

[0020] The multiplication of a matrix A by a matrix B to yield a matrixC is defined only when the number of columns of the matrix A equals thenumbers of rows of the matrix B. To determine the element c_(ij), whichis in the ith row and the jth column of the product, the first elementin the ith row of A is multiplied by the first element in the jth columnof B, the second element in the row by the second element in the column,and so on until the last element in the row is multiplied by the lastelement of the column; the sum of all these products gives the elementc_(ij). In symbols, for the situation where A has n columns and B has nrows,

c _(ij) =a _(i1) b _(1j) +a _(i2) b _(2j) + . . . +a _(in) b _(nj).

[0021] The matrix C has as many rows as A, and as many columns as B.Thus if A has m rows and n columns, and B has n rows and p columns, thenC has m rows and p columns.

[0022] When B has only one column, that is, p=1, B is sometimes referredto as a column vector, or simply a vector. In a common notation, asingle subscript is used to denote elements of a vector. Thus, v_(i) isthe ith element of the vector V.

[0023] The multiplication of a matrix A by a vector V to yield a vectorD is defined only when the number of columns of the matrix A equals thenumber of elements of the vector V. Thus, multiplying an m×n matrix A byan n-element vector V, yields an m element vector D, the elements ofwhich are indicated below, where the symbol “*” denotes multiplication.$\begin{matrix}{D = {{A*V} = \begin{bmatrix}a_{11} & a_{12} & a_{13} & \ldots & a_{1n} \\a_{21} & a_{22} & \quad & \quad & a_{2n} \\a_{31} & \quad & \quad & \quad & \quad \\\vdots & \quad & \quad & \quad & \quad \\a_{m\quad 1} & \ldots & \quad & \quad & a_{mn}\end{bmatrix}}} \\{= \begin{bmatrix}{{a_{11}v_{1}} + {a_{12}v_{2}} + {a_{13}v_{3}}} & {\ldots + {a_{1n}v_{n}}} \\{\quad {{a_{21}v_{1}} + {a_{22}v_{2}} + \ldots}} & {\quad {a_{2n}v_{n}}} \\{\quad \vdots} & \quad \\{\quad {{a_{m\quad 1}v_{1}} + {a_{m\quad 2}v_{2}} + \ldots}} & {a_{mn}v_{n}}\end{bmatrix}}\end{matrix}\begin{bmatrix}v_{1} \\v_{2} \\v_{3} \\\quad \\v_{n}\end{bmatrix}$

[0024] The individual elements of a matrix may be zero or non-zero. Amatrix in which the non-zero elements amount to a very small percentageof the total number of elements, is sometimes referred to as a sparsematrix. Sparse matrices occur frequently in practice. Problems such asstructural analysis, network flow analysis, different approximations todifferential equations, finite element analysis, financial modeling,fluid dynamics, and so forth, all lead to sparse matrices. Becausesparse matrices, and particularly large sparse matrices, frequentlyoccur, techniques have been developed to take advantage of the largenumber of zeros contained in the sparse matrix, to avoid unnecessarycomputation and unnecessary storage.

[0025] When computers are used for sparse matrix computations, thesparse matrix usually is stored in a compressed form to reduce thestorage requirements. In one such known compressed form, only thenon-zero elements of the matrix are stored, along with the row andcolumn location for each non-zero element.

[0026] In one known prior art method, the non-zero elements of each rowof the sparse matrix are stored linearly in a first array, and a secondarray is used to keep track of the locations in the first arraycorresponding to the end of each row of the sparse matrix. A third arrayis used to keep track of the column location in the sparse matrix foreach element in the first array. A known prior art method for computingthe product of such a sparse matrix with a vector is illustrated in FIG.1, and sample code is set forth below; in each the first array is called“matrix”, the second array is called “end_of_row”, the third array iscalled “column”, and the resulting vector is called “result”. do row =1, number_of_rows result (row) = 0.0 do i = (end_of_row(row−1)+1),end_of_row(row) result (row) = result (row) + matrix(i) *vector(column(i)) end do end do

[0027] When using this prior art technique to compute the product of asparse matrix with a vector, it is necessary to determine the columnindex of each element in the first array, and compute its product withthe corresponding element in the vector. This product is thenaccumulated until the end of the row is reached. Once the end of the rowis reached, the accumulator is cleared, and the process is repeated forthe next row. This is done until all the rows are processed.

[0028] The prior art method illustrated in FIG. 1 and in the sample codeabove, includes two DO loops: an outer DO loop; and an inner DO loop.The inner DO loop, denoted by reference numeral 210 in FIG. 2, includes,in general, steps 130, 140, 145 and 150 of FIG. 1; the outer DO loop,denoted by reference numeral 220 in FIG. 2, includes, in general, steps110, 120, 155 and 160 of FIG. 1.

[0029] The inner DO loop is data dependent. That is, the number of timesthe inner loop calculations are performed is determined by the number ofnon-zero elements in each row of the sparse matrix. A particular rowmight have a small number of elements, or a large number of elements;the number of elements is not known until the calculations are made.This results in branch mispredictions caused by the microprocessorpredicting the next computation will be in the inner loop when, inreality, because of the data, another branch of the program—the branchfor the outer DO loop—must be executed next.

[0030] In the illustrated prior art method, such branch mispredictionscan occur at the end of each row of the sparse matrix, that is, at theend of each inner DO loop. Such branch mispredictions in modernmicroprocessors result in lost performance.

[0031] The present invention is directed to overcoming, or at leastreducing, the effects of one or more of the problems mentioned above.

SUMMARY OF THE INVENTION

[0032] In one aspect of the present invention, provided is a computerreadable medium for storing instructions, which, when executed by acomputer, cause the computer to efficiently multiply a sparse matrix bya vector by performing certain steps. The steps include creating a firstarray containing the non-zero elements of the sparse matrix, creating asecond array containing the row position of the last non-zero element ineach row of the sparse matrix, and initializing a variable. Then,executing a set of instructions for each element of the second array,the steps include either equating the variable to the sum of thevariable and the product of a particular element of the first array anda particular element of the vector, or equating a particular element ofthe resulting vector to the variable and then equating the variable to aparticular value.

[0033] In one embodiment of the invention, the set of instructions ispredicated. Yet in another embodiment of the invention, the inventionfurther comprises the step of prefetching the elements of the matrixarray and elements of the column array from memory.

[0034] Another aspect of the invention shows an allocation controlmechanism separating the elements with temporal locality from theelements of spatial locality. The elements with temporal locality isthen stored in a cache memory.

[0035] According to another aspect of the invention, the steps alsoinclude creating a third array containing the column position of each ofthe non-zero elements of the matrix, and using that third array toselect the particular element of the vector that is to be multiplied.

[0036] According to yet another aspect of the present invention, thevector is stored in cache memory, and the first, second, and thirdarrays are stored in a different memory, such that the vector isaccessed via a particular access path and the arrays are accessed via adifferent access path.

BRIEF DESCRIPTION OF THE DRAWINGS

[0037] Other objects and advantages of the invention will becomeapparent upon reading the following detailed description and uponreference to the drawings in which:

[0038]FIGS. 1 and 2 illustrate a known prior art method;

[0039]FIGS. 3, 3a, and 4 illustrate a method according to one embodimentof the present invention;

[0040]FIGS. 5 and 5a illustrate a method according to another embodimentof the present invention;

[0041]FIG. 6 illustrates a method according to yet another embodiment ofthe present invention; and

[0042]FIG. 7 illustrates a method according to still another embodimentof the present invention.

[0043]FIG. 8 illustrates the structure of array allocation of thepresent invention.

[0044] While the invention is susceptible to various modifications andalternative forms, specific embodiments thereof have been shown by wayof example in the drawings and are herein described in detail. It shouldbe understood, however, that the description herein of specificembodiments is not intended to limit the invention to the particularforms disclosed, but on the contrary, the intention is to cover allmodifications, equivalents, and alternatives falling within the spiritand scope of the invention as defined by the appended claims.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

[0045] Illustrative embodiments of the invention are described below. Inthe interest of clarity, not all features of an actual implementationare described in this specification. It will of course be appreciatedthat in the development of any such actual embodiment, numerousimplementation-specific decisions must be made to achieve thedevelopers' specific goals, such as compliance with system-related andbusiness-related constraints, which will vary from one implementation toanother. Moreover, it will be appreciated that such a development effortmight be complex and time-consuming, but would nevertheless be a routineundertaking for those of ordinary skill in the art having the benefit ofthis disclosure.

[0046] Performance of sparse matrix computations on modernmicroprocessors, using known methods, suffers because of the inherentunpredictability of the inner loop closing branch, which results inbranch mispredictions. The iteration count of the inner loop is datadependent, and consequently follows no deterministic pattern for generalsparse matrices.

[0047] The present invention eliminates such unpredictability, therebyeliminating the performance loss due to such branch mispredictions. Thepresent invention collapses the nested loops of the known techniqueillustrated in FIG. 1, into a single larger loop wherein theinstructions in the loop is predicated. This increases the scope ofprefetching and enables better latency tolerance. The present inventionalso makes use of otherwise wasted computations resulting from branchmispredictions. The present invention, by managing cache allocation,also permits data to be organized to maximize bandwidth utilization.

[0048]FIG. 1 shows an inefficient nested loop formulation. The innerloop count is small and the loop overhead is large. The “end do” for the“do i” loop is flaky, meaning that the branch could go either way, thusmaking it hard for the branch predictor to predict the right direction.Hence, this flakiness causes branch misprediction.

[0049] The present invention solves this problem by collapsing the twoloops shown in prior art into one as shown in FIG. 3 and usingpredication. Predication eliminates the remaining flakiness that stillexists after collapsing the loops. The idea here is that everyinstruction in the instruction set is augmented with a field that says“execute this instruction if the predicate is true.” The predicate is aflag, a logical, a Boolean value that says true or false. In the presentinvention, the Boolean value is the condition “if, then, elsestatements.” When the condition is evaluated, the result is either trueor false. Typically, a set of instructions is executed when thecondition is true and another set of instructions is executed if thecondition is false. Predication, on the other hand, executes both setsof instructions in one sequence. By predicating the instructions thatare to be executed in the single collapsed loop, the instructions areexecuted in one flow. The branches are removed from the equation. Thus,the control flow depicted in the prior art (FIG. 1) is converted intodata flow as shown in FIG. 3. Since the present invention does notinvolve branches, the present invention does away with branchmispredictions. Consequently, by collapsing the two loops shown in priorart (FIG. 1) into one as shown in FIG. 3 and using predication, thepresent invention eliminates the problem of branch mispredictions.

[0050] To help describe the present invention, specific mathematicalexamples are used. Obviously, these examples are used for illustrativepurposes only, and the present invention is not limited to theseexamples. $\begin{bmatrix}1 & 0 & 0 & 0 & 5 & 0 & 0 & 0 & 0 & 0 \\0 & 0 & 4 & 0 & 0 & 2 & 0 & 0 & 6 & 0 \\0 & 0 & 0 & 12 & 0 & 0 & 0 & 0 & 0 & 0 \\0 & 4 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 \\2 & 0 & 0 & 0 & 3 & 0 & 0 & 1 & 3 & 0 \\0 & 0 & 0 & 13 & 2 & 1 & 4 & 0 & 0 & 0 \\0 & 1 & 0 & 0 & 0 & 0 & 0 & 8 & 0 & 0 \\3 & 6 & 8 & 0 & 0 & 0 & 0 & 0 & 0 & 11 \\0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0\end{bmatrix}\quad$

[0051] is a 9×10 matrix, and thus has 90 elements. Of these, 23 arenon-zero elements, and 67 are zero elements. Matrix A is therefore asparse matrix because the non-zero elements amount to only a very smallpercentage of the total number of elements.

[0052] The vector V, $\begin{bmatrix}1 \\2 \\2 \\3 \\0 \\11 \\0 \\0 \\0 \\2\end{bmatrix}\quad$

[0053] is a column vector containing 10 elements. Because the vector Vhas as many rows as the matrix A has columns, multiplication of thematrix A by the vector V is defined. The product of this multiplication,a vector C, $C = {{A*V} = \begin{bmatrix}c_{1} \\c_{2} \\c_{3} \\\vdots \\c_{9}\end{bmatrix}}$

[0054] has, as its first element c₁,

c ₁=(1×1)+(0×2)+(0×2)+(0×3)+(5×0)+(0×11)+(0×0)+(0×0)+(0×0)+(0×2)c ₁=1

[0055] as its second element, c₂,

c ₂=(0×1)+(0×2)+(4×2)+(0×3)+(0×0)+(2×11)+(0×0)+(0×0)+(0×0)+(6×0)+(0×2)c₂=22;

[0056] and so forth. Because the matrix A has numerous zero elements,the computation for each of the elements c_(i) of the vector C entailsnumerous multiplications by zero, which would not occur in a sparserepresentation.

[0057] According to one aspect of the present invention, three arraysare formed: A first array, containing the non-zero elements of thematrix A; a second array, containing the end_of_row locations of thefirst array; and a third array, containing the column locations of eachelement of matrix A. Each is described more fully below. Each ispreferably, but need not be, a linear array.

[0058] For the example given above, the non-zero elements of the matrixA are: 1, 5, 4, 2, 6, 12, 4, 3, 2, 3, 1, 3, 13, 2, 1, 4, 1, 8, 3, 6, 8,11, and 3. The first two of these elements, 1 and 5, are contained inthe first row of the matrix A, and are located in columns 1 and 5,respectively. In the first row of the matrix A, 1 is the first non-zeroelement, and 5 is the last non-zero element; the last non-zero elementin a particular row is called the end_of_row element for purposes of thepresent invention.

[0059] Similarly, in the second row of the matrix A, the first non-zeroelement is 4, the second non-zero element is 2, and the last non-zeroelement is 6. These three non-zero elements are located in columns 3, 6,and 9, respectively. The last of these three elements, 6, is theend_of_row element of the second row.

[0060] Table 1 sets forth similar information for each of the non-zeroelements of matrix A. TABLE 1 Non-zero elements of matrix A: is elementan non-zero element's position end_of_row element's column element infirst array element position in matrix A  1  1 no  1  5  2 yes  5  4  3no  3  2  4 no  6  6  5 yes  9 12  6 yes  4  4  7 no  2  3  8 yes  7  2 9 no  1  3 10 no  5  1 11 no  8  3 12 yes  9 13 13 no  4  2 14 no  5  115 no  6  4 16 yes  7  1 17 no  2  8 18 yes  8  3 19 no  1  6 20 no  2 8 21 no  3 11 22 yes 10  3 23 yes  8

[0061] Accordingly, for the example given above, the first, second, andthird arrays are as shown below.

[0062] First Array—the Non Zero Elements of the Matrix A:

[0063] 1 5 4 2 6 12 4 3 2 3 1 3 13 2 1 4 1 8 3 6 8 11 3

[0064] Second Array—End_of_Row Locations of the First Array:

[0065] 2 5 6 8 12 16 18 22 23

[0066] Third Array—Column Locations of Each Element of the Matrix A

[0067] 1 5 3 6 9 4 2 7 1 5 8 9 4 5 6 7 2 8 1 2 3 10 8

[0068] In the description that follows, the first array is called“matrix”, the second array is called “end_of_row”, and the third arrayis called “column”.

[0069] Referring now to the drawings, FIG. 3 illustrates the logic flowaccording to one embodiment of the present invention. In the stepsdenoted by reference numeral 310, two variables are initialized. Thefirst, called “row”, is a variable used to count rows. The second,called “accumulator”, is used to accumulate particular calculatedvalues.

[0070] The next step, denoted by reference numeral 320, begins a loop inwhich a variable, ii, is incremented, in increments of 1, starting withthe value 1 and ending with the last non zero element of the matrix A.

[0071] A product, called “tmp_product”, is then calculated in the stepdenoted by reference numeral 330, by multiplying a particular element ofthe first array, the “matrix” array, and a particular element of thevector V, called “vector” in FIG. 3.

[0072] A test of the variable ii is then performed in the step denotedby reference numeral 340. This step determines whether the variable iiis greater than the end_of_row location for the particular value of thevariable “row”. If it is not greater, then, in the step denoted byreference numeral 350, the variable “accumulator” is assigned the valueequal to the sum of “accumulator” and “tmp_product”. If, on the otherhand, it is greater, then the steps denoted by reference numerals 360,370, and 380 are performed. By those steps, the resulting vector C,called “result” in FIG. 3, is assigned a particular value. Morespecifically, the element of the resulting vector corresponding to thevariable “row” is assigned the value of “accumulator”. The variable“row” is incremented by 1, and the variable “accumulator” is assigned aparticular value, namely the value of “tmp_product”.

[0073] The decision of step 390 is then made, namely, whether or not theloop has been completed for all rows contained in the matrix A. That is,the loop of the do statement ends when ii=end_of_row(number_of_rows) asshown in the following sample code.

[0074] The final step 400 stores the value in the accumulator into theresult(row) vector.

[0075] Sample code for implementing the embodiment illustrated in FIG. 3is set forth below. row = 1 accumulator = 0.0 do ii = 1, end_of_row(number_of_rows) tmp_product = matrix(ii) * vector (column (ii)) if(ii > end_of_row (row)) then result (row)  = accumulator row = row + 1accumulator = 0.0 + tmp_product else accumulator = accumulator +tmp_product endif end do result (row) = accumulator

[0076] The embodiment illustrated in FIG. 3 produces the same resultingvector values as the prior art method illustrated in FIG. 1, but does sowith only a single DO loop, whose bound is known at run time. Referringnow to FIG. 4, the initialization block 410 includes the initializationsteps denoted by the reference numeral 310 in FIG. 3, and the DO loopbox 420 in FIG. 4 includes the steps denoted by the reference numerals320, 330, 335, 340, 350, 360, 370, 380, and 390 in FIG. 3. FIG. 4, whencompared with FIG. 2, illustrates the greatly reduced complexity of thepresent invention as shown in FIG. 3, from the prior art as shown inFIG. 1.

[0077] In the embodiment of the present invention illustrated in FIG. 3,and in the sample code set forth above, the calculations in the step 330are performed and the resulting product assigned to the variable“tmp_product”. In another embodiment of the present invention,illustrated in FIG. 5, those calculations are not made before thedecision box regarding ii, denoted by reference numeral 340 in FIG. 3and reference numeral 540 in FIG. 5, but rather are made in the steps ofFIG. 5 denoted by reference numerals 550 and 580. Moving thiscalculation to before the decision regarding ii, as shown in FIG. 3 andin the sample code above, is a particular optimization of the embodimentillustrated in FIG. 5. Such an optimization, which is possible with themethods of the present invention, is not possible with the prior artmethod illustrated in FIG. 1, because of the nested DO loops of theprior art method.

[0078]FIG. 5a is similar to FIG. 5, but with the reference numeralsremoved and certain steps referenced by the letters A, B, C, and D, tomore clearly correlate certain steps of the embodiment illustrated inFIGS. 5 and 5a with certain aspects of the present invention. Thus,according to one embodiment of the present invention, a variable isinitialized, as denoted by the reference letter A in FIG. 5a. Then, foreach element of the second array, either the variable is assigned thesum of the variable and the product of a particular element of the firstarray and a particular element of the vector, denoted by the referenceletter B in FIG. 5a; or, a particular element of the resulting vector isassigned the variable, and then the variable is assigned a particularvalue, denoted by the reference letter C in FIG. 5a. The referenceletter D in FIG. 5a denotes the various steps involved in performingthis either-or process for each of the elements of the second array.

[0079]FIG. 3a is similar to FIG. 5a, and further describes the optimizedembodiment illustrated in FIG. 3. In this optimized embodiment“tmp_product” is used both in the steps denoted by the reference letterB and in the steps denoted by the reference letter C; consequently bothreference letters, B and C, are used in FIG. 3a for the step 330 of FIG.3.

[0080] In the embodiments of FIGS. 3 and 5, the index variable ii isincremented in increments of 1. In some computer systems, it may beadvantageous to increment ii by 2. FIG. 6 illustrates such anembodiment, and sample code is set forth below. row = 1 s = 0.0 t = 0.0do ii = 1, end_of_row (number _of_rows), 2    if (ii > end_of row (row))then       result (row) = s       row = row + 1       t = matrix (ii) *vector (colunm (ii))    else       t = s + matrix (ii) * vector (column(ii))    end if    if ((ii + 1) > end_of_row (row)) then       result(row) = t       row = row + 1       s = matrix (ii + 1) * vector (column(ii + 1))    else       s = t + matrix (ii + 1) * vector (column (ii +1))    end if enddo

[0081] if (mod(end_of_row(number_of_rows), 2)=0)result(row)=s //storesthe final result in the event that the total count of non-zero elements,end_of_row(number_of_rows) is even, i.e, divisible by 2

[0082] The invention as thus far described may advantageously be used tomultiply a sparse matrix with a vector. As will be apparent to thoseskilled in the art from benefit of the description contained herein, thepresent invention is not limited to applications involving sparsematrices. Rather, it can be used with any matrix containing zeroelements and non-zero elements.

[0083] Additionally, the present invention is not applicable only tomultiplying a matrix by a vector. It may advantageously be used tomultiply a matrix by another matrix. As described above, a vector is amatrix having a single column; the embodiments of the present inventionillustrated in FIGS. 3 and 5 act on the single column of valuescontained in a column vector. FIG. 7 illustrates the more general case,where the second array, instead of being a column vector, is an arrayhaving one or more columns. The steps denoted by reference numerals 710,720, 730, 740, 750, 760, 770, 780, and 790 are similar to stepsillustrated in FIG. 3. Note, however, that in the embodiment illustratedin FIG. 7, the array “vector” and the array “result” each have anadditional index, called “COL”, which is permitted to vary from 1 to thenumber of columns contained in the array. This is denoted by the stepslabeled 705 and 795 in FIG. 7.

[0084] Thus the embodiment illustrated in FIG. 7 can advantageously beused to multiply a matrix having m rows and n columns, containingnon-zero elements and zero elements, by an initial array having n rowsand p columns, and produce a resulting array having m rows and pcolumns.

[0085] Referring once again to the prior art method illustrated in FIG.1 and in the sample code set forth above, the difference between“end_of_row (row −1)” minus “end_of_row (row)” determines the iterationcount of the inner loop in the prior art method. This difference isdependent on the number of non-zero elements in that row, which variesfrom row to row and is thus unpredictable. This unpredictability of theloop branch causes mispredictions in modern microprocessors and resultsin loss of performance.

[0086] The present invention recognizes and exploits certain aspects ofthe computation. When the inner loop is exited, the value of “i” is“end_of_row (row) +1”. When the inner loop is re-entered the next time,that is, after the outer loop index “row” has been incremented, thevalue of “i” is “end_of_row (row +1 −1) +1”. Both of these values of “i”are the same, that is, the index variable of the inner loop isincremented sequentially. This means that if the inner loop closingbranch was mispredicted after the last iteration of the inner loop, andif as a result of that misprediction the inner loops computation isperformed speculatively, then that computed result need not be discardedbut rather can be used for the next iteration of the outer loop. Thusthe inner loop computation “matrix (i)*vector (column (i))” can beperformed regardless of whether the end of the row has been reached. Theonly aspect of the inner loop computation that changes from oneiteration of the outer loop to the next is the accumulator, whichchanges from “result (row)” to “result (row +1)”.

[0087] Another important aspect recognized and exploited by the presentinvention, is that the outer loop sequences through the rows of thematrix, and the inner loop sequences through the elements of each row.Since the rows are all placed end to end in the matrix array, these twoloops together essentially sequence through all of the elements in thematrix array. Thus the loop nest can be flattened into a single loop.

[0088] The methods of the present invention eliminate the mispredictionsthat occur in the prior art method at the end of each row. Because thecomputations are done transparently across the end of each row in thepresent invention, the scope of prefetching of data elements (“matrix”,“end_of_row”, and “column”) is enhanced, thus enabling better latencyhiding.

[0089] Latency is a terminology used to describe the time delay thatoccurs when retrieving elements from memory, e.g., matrix and column.The reformulation of the codes in collapsing the two loops into onesingle loop with predication enables the present invention to prefetchelements from memory. These elements must be prefetched from memory in aspecific amount of time to eliminate the memory latency. That is, theelements must be prefetched from memory in the amount of time it takesto fetch elements from memory (latency of memory) plus the time it takesto fetch elements from the cache(latency of cache).

[0090]FIG. 8 shows the structure of array allocation. Both column 810and matrix 820 are stored in memory 850. The vector is stored in thesecond level of cache, L2 870, while row and end_of_row are stored inthe first level of cache, L1 880. The microprocessor has to go around,as shown by the arrows 830 and 840, the cache 860 in order to retrieveelements from matrix 820 and column 810. Using the embodiment shown inFIG. 3 and the sample code on page 18, in order to computetmp_product=matrix(ii)*vector(column (ii)), the elements of column(ii)must be prefetched in the amount of time it takes to fetch elements ofcolumn(ii) from memory (latency of memory) plus the time it takes tofetch vector(column(ii)) from cache (latency of L2). Elements frommatrix(ii) must be prefetched only in the amount of time it takes tofetch the matrix elements from memory (latency of memory.)

[0091] With the penalty associated with branch mispredictions andlatency problems removed, performance is largely limited by bandwidth tothe data store. Typically, microprocessors could perform thecomputations in much less time than the time it takes to fetch theelements necessary for the computation. Consequently, the faster themicroprocessor fetches the elements, the more computation it canperform. Microprocessors are typically designed with a small amount ofbandwidth. Thus, the bandwidth is a scarce resource.

[0092] The accesses to the “vector” array possess temporal locality butnot spatial locality, and the accesses to the “matrix”, “end_of_row”,and “column” arrays possess spatial locality, but not temporal locality.This property can advantageously be used in managing cache allocations,such that the “vector” array is stored in the cache hierarchy, and the“matrix”, “end_of row”, and “column” arrays bypass the caches. Thisprovides increased performance by eliminating wasted bandwidth caused byaccessing the “vector” array via the access path used to access the“matrix”, “end_of_row”, “column” arrays.

[0093] The present invention employs allocation control mechanisms toseparate the temporal-nonspatial elements from the nontemporal-spatialelements. Based on these mechanisms, the temporal-nonspatial elementsare stored in cache while the nontemporal-spatial elements are not.Vector elements possess temporal-nonspatial characteristics while matrixand column elements possess nontemporal-spatial characteristics. Thetemporal-nonspatial elements are stored in cache because they will beused again by the microprocessor while the nontemporal-spatial elementsare used only once.

[0094] Since the nontemporal-spatial elements are used only once, thepresent invention strides through these column and matrix elements,i.e., using stride one bandwidth. In doing so, the bandwidth is usedmost efficiently. By using these allocation control mechanisms, thepresent invention utilizes the microprocessor's scarce and valuableresource, its memory bandwidth, more efficiently and at the same time,maintains the balance of the machine.

[0095] Without allocation control mechanisms, the microprocessor wouldstore the nontemporal-spatial elements in cache, which would displacethe temporal-nonspatial elements already stored in cache. When themicroprocessor needs a temporal-nonspatial element, i.e., a vectorelement, that was displaced by the nontemporal-spatial element, it wouldhave to gather that element again. Thus, one advantage of the use ofallocation control mechanism is that it reduces the bandwidth gatherrequirement.

[0096] The particular embodiments disclosed above are illustrative only,as the invention may be modified and practiced in different butequivalent manners apparent to those skilled in the art having thebenefit of the teachings herein. Furthermore, no limitations areintended to the details of construction or design herein shown, otherthan as described in the claims below. It is therefore evident that theparticular embodiments disclosed above may be altered or modified andall such variations are considered within the scope and spirit of theinvention. Accordingly, the protection sought herein is as set forth inthe claims below.

What is claimed:
 1. A computer readable medium for storing instructionswhich, when executed by a computer, cause the computer to multiply asparse matrix by a vector and produce a resulting vector, by performingthe steps of: creating a first array of elements containing the non-zeroelements of the sparse matrix; creating a second array of elementscontaining the row position of the last non-zero element in each row ofthe sparse matrix; initializing a variable; and executing a set ofinstructions for each element of the first array, either equating thevariable to the sum of the variable and the product of the element ofthe first array and a particular element of the vector, or equating aparticular element of the resulting vector to the variable, and thenequating the variable to a particular value.
 2. The computer readablemedium of claim 1 wherein the set of instructions is predicated.
 3. Thecomputer readable medium of claim 1 wherein the particular value is theproduct of the element of the first array and the particular element ofthe vector.
 4. The computer readable medium of claim 1 furthercomprising the step of prefetching the elements of the first array frommemory.
 5. The computer readable medium of claim 1 further comprising anallocation control mechanism wherein the allocation control mechanismseparates the elements with temporal locality from the elements withspatial locality.
 6. The computer readable medium of claim 1 furthercomprising the step of storing the elements with temporal locality in acache memory.
 7. The computer readable medium of claim 1 furthercomprising the step of striding through the elements with spatiallocality.
 8. The computer readable medium of claim 1 wherein theelements of the first array maintain spatial locality.
 9. The computerreadable medium of claim 1 wherein the elements of the vector maintaintemporal locality.
 10. A method for causing a computer to multiply asparse matrix by a vector and produce a resulting vector, comprising thesteps of: creating a first array of elements containing the non-zeroelements of the sparse matrix; creating a second array of elementscontaining the row position of the last non-zero element in each row ofthe sparse matrix; initializing a variable; and executing a set ofinstructions for each element of the first array, either equating thevariable to the sum of the variable and the product of the element ofthe first array and a particular element of the vector, or equating aparticular element of the resulting vector to the variable, and thenequating the variable to a particular value.
 11. The method of claim 10wherein the set of instructions is predicated.
 12. The method of claim10 wherein the particular value is the product of the element of thefirst array and the particular element of the vector.
 13. The method ofclaim 10 further comprising the step of prefetching the elements of thefirst array from memory.
 14. The method of claim 10 further comprisingan allocation control mechanism wherein the allocation control mechanismseparates the elements with temporal locality from the elements withspatial locality.
 15. The method of claim 10 further comprising the stepof storing the elements with temporal locality in a cache memory. 16.The method of claim 10 further comprising the step of striding throughthe elements with spatial locality.
 17. The method of claim 10 whereinthe elements of the first array maintain spatial locality.
 18. Themethod of claim 10 wherein the elements of the vector maintain temporallocality.
 19. A computer system, comprising a microprocessor and amedium containing instructions, wherein the instructions, when executedby a computer, cause the computer to multiply a sparse matrix by avector and produce a resulting vector, by performing the steps of:creating a first array of elements containing the non-zero elements ofthe sparse matrix; creating a second array of elements containing therow position of the last non-zero element in each row of the sparsematrix; initializing a variable; and executing a set of instructions foreach element of the first array, either equating the variable to the sumof the variable and the product of the element of the first array and aparticular element of the vector, or equating a particular element ofthe resulting vector to the variable, and then equating the variable toa particular value.
 20. The computer system of claim 19 wherein the setof instructions is predicated.
 21. The computer system of claim 19wherein the particular value is the product of the element of the firstarray and the particular element of the vector.
 22. The computer systemof claim 19 further comprising the step of prefetching the elements ofthe first array from memory.
 23. The computer system of claim 19 furthercomprising an allocation control mechanism wherein the allocationcontrol mechanism separates the elements with temporal locality from theelements with spatial locality.
 24. The computer system of claim 19further comprising the step of storing the elements with temporallocality in a cache memory.
 25. The computer system of claim 19 furthercomprising the step of striding through the elements with spatiallocality.
 26. The computer system of claim 19 wherein the elements ofthe first array maintain spatial locality.
 27. The computer system ofclaim 19 wherein the elements of the vector maintain temporal locality.28. A computer readable medium for storing instructions which, whenexecuted by a computer, causes the computer to multiply a matrix havingrows and columns containing non-zero elements and zero elements by avector and produce a resulting vector, by performing the steps of:creating a first array of elements containing the non-zero elements ofthe matrix; creating a second array of elements containing the rowposition of the last non-zero element in each row of the matrix;creating a third array of elements containing the column position ofeach non-zero element of the matrix; initializing a first variable;initializing a second variable; executing a set of instructions for anindex incremented from 1 to the last element of the second array, inincrements of 1, equating a third variable to the product of the elementof the first array corresponding to the index, and the element of thevector corresponding to the element of the third array corresponding tothe index; if the index is less than or equal to the element of thesecond array corresponding to the first variable, equating the secondvariable to the sum of the second variable and the third variable; ifthe index is greater than the element of the second array correspondingto the first variable, equating the element of the resulting vectorcorresponding to the first variable to the second variable, incrementingthe first variable by 1, and equating the second variable to aparticular value.
 29. The computer readable medium of claim 28 whereinthe set of instructions is predicated.
 30. The computer readable mediumof claim 28 further comprising the step of prefetching the elements ofthe first array and the elements of the third array from memory.
 31. Thecomputer readable medium of claim 28 further comprising an allocationcontrol mechanism wherein the allocation control mechanism separates theelements with temporal locality from the elements with spatial locality.32. The computer readable medium of claim 28 further comprising the stepof storing the elements with temporal locality in a cache memory. 33.The computer readable medium of claim 28 further comprising the step ofstriding through the elements with spatial locality.
 34. The computerreadable medium of claim 28 wherein the elements of the first array andthe elements of the third array maintain spatial locality.
 35. Thecomputer readable medium of claim 28 where in the elements of the vectormaintain temporal locality.
 36. The computer readable medium of claim 28wherein the particular value is the product of the element of the firstarray corresponding to the index, and the element of the vectorcorresponding to the element of the third array corresponding to theindex.
 37. The computer readable medium of claim 28 further includinginstructions which, when executed by the computer, cause the computer toperform the steps of: storing at least a portion of the vector in afirst memory; and storing the first array and/or the second array and/orthe third array in a second memory.
 38. The computer readable medium ofclaim 37 wherein the step of storing in a first memory includes storingin a cache memory.
 39. The computer readable medium of claim 37 furtherincluding instructions which, when executed by the computer, cause thecomputer to perform the steps of: accessing the vector stored in thefirst memory via a first access path; and accessing the first arrayand/or the second array and/or the third array stored in the secondmemory via an access path different from said first access path.
 40. Amethod for causing a computer to multiply a matrix having rows andcolumns containing non-zero elements and zero elements by a vector andproduce a resulting vector, comprising the steps of: creating a firstarray of elements containing the non-zero elements of the matrix;creating a second array of elements containing the row position of thelast non-zero element in each row of the matrix; creating a third arrayof elements containing the column position of each non-zero element ofthe matrix; initializing a first variable; initializing a secondvariable; executing a set of instructions for an index incremented from1 to the last element of the second array, in increments of 1, equatinga third variable to the product of the element of the first arraycorresponding to the index, and the element of the vector correspondingto the element of the third array corresponding to the index; if theindex is less than or equal to the element of the second arraycorresponding to the first variable, equating the second variable to thesum of the second variable and the third variable; if the index isgreater than the element of the second array corresponding to the firstvariable, equating the element of the resulting vector corresponding tothe first variable to the second variable, incrementing the firstvariable by 1, and equating the second variable to a particular value.41. The method of claim 40 wherein the set of instructions ispredicated.
 42. The method of claim 40 further comprising the step ofprefetching the elements of the first array and the elements of thethird array from memory.
 43. The method of claim 40 further comprisingan allocation control mechanism wherein the allocation control mechanismseparates the elements with temporal locality from the elements withspatial locality.
 44. The method of claim 40 further comprising the stepof storing the elements with temporal locality in a cache memory. 45.The method of claim 40 further comprising the step of striding throughthe elements with spatial locality.
 46. The method of claim 40 whereinthe elements of the first array and the elements of the third arraymaintain spatial locality.
 47. The method of claim 40 wherein theelements of the vector maintain temporal locality.
 48. The method ofclaim 40 where in the particular value is the product of the element ofthe first array corresponding to the index, and the element of thevector corresponding to the element of the third array corresponding tothe index.
 49. The method of claim 40 further comprising the steps of:storing at least a portion of the vector in a first memory; and storingthe first array and/or the second array and/or the third array in asecond memory.
 50. The method of claim 49 wherein the step of storing ina first memory includes storing in a cache memory.
 51. The method ofclaim 49 further comprising the steps of: accessing the vector stored inthe first memory via a first access path; and accessing the first arrayand/or the second array and/or the third array stored in the secondmemory via an access path different from said first access path.
 52. Acomputer system, comprising a microprocessor and a medium containinginstructions, wherein the instructions, when executed by a computer,cause the computer to multiply a matrix having rows and columnscontaining non-zero elements and zero elements by a vector and produce aresulting vector, by performing the steps of: creating a first array ofelements containing the non-zero elements of the matrix; creating asecond array of elements containing the row position of the lastnon-zero element in each row of the matrix; creating a third array ofelements containing the column position of each non-zero element of thematrix; initializing a first variable; initializing a second variable;executing a set of instructions for an index incremented from 1 to thelast element of the second array, in increments of 1, equating a thirdvariable to the product of the element of the first array correspondingto the index, and the element of the vector corresponding to the elementof the third array corresponding to the index; if the index is less thanor equal to the element of the second array corresponding to the firstvariable, equating the second variable to the sum of the second variableand the third variable; if the index is greater than the element of thesecond array corresponding to the first variable, equating the elementof the resulting vector corresponding to the first variable to thesecond variable, incrementing the first variable by 1, and equating thesecond variable to a particular value.
 53. The computer system of claim52 wherein the set of instructions is predicated.
 54. The computersystem of claim 52 further comprising the step of prefetching theelements of the first array and the elements of the third array frommemory.
 55. The computer system of claim 52 further comprising anallocation control mechanism wherein the allocation control mechanismseparates the elements with temporal locality from the elements withspatial locality.
 56. The computer system of claim 52 further comprisingthe step of storing the elements with temporal locality in a cachememory.
 57. The computer system of claim 52 further comprising the stepof striding through the elements with spatial locality.
 58. The computersystem of claim 52 wherein the elements of the first array and theelements of the third array maintain spatial locality.
 59. The computersystem of claim 52 wherein the elements of the vector maintain temporallocality.
 60. The computer system of claim 52 wherein the particularvalue is the product of the element of the first array corresponding tothe index, and the element of the vector corresponding to the element ofthe third array corresponding to the index.
 61. The computer system ofclaim 52 further including instructions which, when executed by thecomputer, cause the computer to perform the steps of: storing at least aportion of the vector in a first memory; and storing the first arrayand/or the second array and/or the third array are stored in a secondmemory.
 62. The computer system of claim 61 wherein the step of storingin a first memory includes storing in a cache memory.
 63. The computersystem of claim 61 wherein: the vector stored in the first memory isaccessed via a first access path; and the first array and/or the secondarray and/or the third array stored in the second memory is accessed viaan access path different from said first access path.
 64. A computerreadable medium for storing instructions which, when executed by acomputer, cause the computer to multiply a matrix having m rows and ncolumns containing non-zero elements and zero elements by an initialarray having n rows and p columns and produce a resulting array having mrows and p columns, by performing the steps of: creating a first arrayof elements containing the non-zero elements of the matrix; creating asecond array of elements containing the row position of the lastnon-zero element in each row of the matrix; creating a third array ofelements containing the column position of each non-zero element of thematrix; executing a set of instructions for each column of the initialarray and the resulting array, incremented from 1 to p in increments of1, initializing a first variable; initializing a second variable; for anindex incremented from 1 to the last element of the second array, inincrements of 1, equating a third variable to the product of the elementof the first array corresponding to the index, and the element of theinitial array corresponding to the element of the third arraycorresponding to the index; if the index is less than or equal to theelement of the second array corresponding to the first variable,equating the second variable to the sum of the second variable and thethird variable; if the index is greater than the element of the secondarray corresponding to the first variable, equating the element of theresulting array corresponding to the first variable to the secondvariable, incrementing the first variable by 1, and equating the secondvariable to a particular value.
 65. The computer readable medium ofclaim 64 wherein the set of instructions is predicated.
 66. The computerreadable medium of claim 64 further comprising the step of prefetchingthe elements of the first array and the elements of the third array frommemory.
 67. The computer readable medium of claim 64 further comprisingan allocation control mechanism wherein the allocation control mechanismseparates the elements with temporal locality from the elements withspatial locality.
 68. The computer readable medium of claim 64 furthercomprising the step of storing the elements with temporal locality in acache memory.
 69. The computer readable medium of claim 64 furthercomprising the step of striding through the elements with spatiallocality.
 70. The computer readable medium of claim 64 wherein theelements of the first array and the elements of the third array maintainspatial locality.
 71. The computer readable medium of claim 64 whereinthe elements of the initial array maintain temporal locality.
 72. Thecomputer readable medium of claim 64 wherein p is greater than
 1. 73. Amethod for causing a computer to multiply a matrix having m rows and ncolumns containing non-zero elements and zero elements by an initialarray having n rows and p columns and produce a resulting array having mrows and p columns, comprising the steps of: creating a first array ofelements containing the non-zero elements of the matrix; creating asecond array of elements containing the row position of the lastnon-zero element in each row of the matrix; creating a third array ofelements containing the column position of each non-zero element of thematrix; executing a set of instructions for each column of the initialarray and the resulting array, incremented from 1 top in increments of1, initializing a first variable; initializing a second variable; for anindex incremented from 1 to the last element of the second array, inincrements of 1, equating a third variable to the product of the elementof the first array corresponding to the index, and the element of theinitial array corresponding to the element of the third arraycorresponding to the index; if the index is less than or equal to theelement of the second array corresponding to the first variable,equating the second variable to the sum of the second variable and thethird variable; if the index is greater than the element of the secondarray corresponding to the first variable, equating the element of theresulting array corresponding to the first variable to the secondvariable, incrementing the first variable by 1, and equating the secondvariable to a particular value.
 74. The method of claim 73 wherein theset of instructions is predicated.
 75. The method of claim 73 furthercomprising the step of prefetching the elements of the first array andthe elements of the third array from memory.
 76. The method of claim 73further comprising an allocation control mechanism wherein theallocation control mechanism separates the elements with temporallocality from the elements with spatial locality.
 77. The method ofclaim 73 further comprising the step of storing the elements withtemporal locality in a cache memory.
 78. The method of claim 73 furthercomprising the step of striding through the elements with spatiallocality.
 79. The method of claim 73 wherein the elements of the firstarray and the elements of the third array maintain spatial locality. 80.The method of claim 73 wherein the elements of the initial arraymaintain temporal locality.
 81. The method of claim 73 wherein p isgreater than
 1. 82. A computer system, comprising a microprocessor and amedium containing instructions, wherein the instructions, when executedby a computer, cause the computer to multiply a matrix having m rows andn columns containing non-zero elements and zero elements by an initialarray having n rows and p columns and produce a resulting array having mrows and p columns, by performing the steps of: creating a first arrayof elements containing the non-zero elements of the matrix; creating asecond array of elements containing the row position of the lastnon-zero element in each row of the matrix; creating a third array ofelements containing the column position of each non-zero element of thematrix; executing a set of instructions for each column of the initialarray and the resulting array, incremented from 1 top in increments of1, initializing a first variable; initializing a second variable; for anindex incremented from 1 to the last element of the second array, inincrements of 1, equating a third variable to the product of the elementof the first array corresponding to the index, and the element of theinitial array corresponding to the element of the third arraycorresponding to the index; if the index is less than or equal to theelement of the second array corresponding to the first variable,equating the second variable to the sum of the second variable and thethird variable; if the index is greater than the element of the secondarray corresponding to the first variable, equating the element of theresulting array corresponding to the first variable to the secondvariable, incrementing the first variable by 1, and equating the secondvariable to a particular value.
 83. The computer system of claim 82wherein the set of instructions is predicated.
 84. The computer systemof claim 82 further comprising the step of prefetching the elements ofthe first array and the elements of the third array from memory.
 85. Thecomputer system of claim 82 further comprising an allocation controlmechanism wherein the allocation control mechanism separates theelements with temporal locality from the elements with spatial locality.86. The computer system of claim 82 further comprising the step ofstoring the elements with temporal locality in a cache memory.
 87. Thecomputer system of claim 82 further comprising the step of stridingthrough the elements with spatial locality.
 88. The computer system ofclaim 82 where in the elements of the first array and the elements ofthe third array maintain spatial locality.
 89. The computer system ofclaim 82 wherein the elements of the initial array maintain temporallocality.
 90. The computer system of claim 82 wherein p is greaterthan
 1. 91. A computer readable medium for storing instructions which,when executed by a computer, cause the computer to multiply a matrixhaving rows and columns containing non-zero elements and zero elementsby a vector and produce a resulting vector, by performing the steps of:creating a first array of elements containing the non-zero elements ofthe matrix; creating a second array of elements containing the rowposition of the last non-zero element in each row of the matrix;creating a third array of elements containing the column position ofeach non-zero element of the matrix; initializing a first variable;initializing a second variable; executing a set of instructions for anindex incremented from 1 to the last element of the second array, inincrements of 1, if the index is less than or equal to the element ofthe second array corresponding to the first variable, equating thesecond variable to the sum of the second variable and the product of theelement of the first array corresponding to the index, and the elementof the vector corresponding to the element of the third arraycorresponding to the index; if the index is greater than the element ofthe second array corresponding to the first variable, equating theelement of the resulting vector corresponding to the first variable tothe second variable, incrementing the first variable by 1, and equatingthe second variable to a particular value.
 92. The computer readablemedium of claim 91 wherein the set of instructions is predicated. 93.The computer readable medium of claim 91 further comprising the step ofprefetching the elements of the first array and the elements of thethird array from memory.
 94. The computer readable medium of claim 91further comprising an allocation control mechanism wherein theallocation control mechanism separates the elements with temporallocality from the elements with spatial locality.
 95. The computerreadable medium of claim 91 further comprising the step of storing theelements with temporal locality in a cache memory.
 96. The computerreadable medium of claim 91 further comprising the step of stridingthrough the elements with spatial locality.
 97. The computer readablemedium of claim 91 wherein the elements of the first array and theelements of the third array maintain spatial locality.
 98. The computerreadable medium of claim 91 wherein the elements of the vector maintaintemporal locality.
 99. A method for causing a computer to multiply amatrix having rows and columns containing non-zero elements and zeroelements by a vector and produce a resulting vector, comprising thesteps of: creating a first array of elements containing the non-zeroelements of the matrix; creating a second array of elements containingthe row position of the last non-zero element in each row of the matrix;creating a third array of elements containing the column position ofeach non-zero element of the matrix; initializing a first variable;initializing a second variable; executing a set of instructions for anindex incremented from 1 to the last element of the second array, inincrements of 1, if the index is less than or equal to the element ofthe second array corresponding to the first variable, equating thesecond variable to the sum of the second variable and the product of theelement of the first array corresponding to the index, and the elementof the vector corresponding to the element of the third arraycorresponding to the index; if the index is greater than the element ofthe second array corresponding to the first variable, equating theelement of the resulting vector corresponding to the first variable tothe second variable, incrementing the first variable by 1, and equatingthe second variable to a particular value.
 100. The method of claim 99wherein the set of instructions is predicated.
 101. The method of claim99 further comprising the step of prefetching the elements of the firstarray and the elements of the third array from memory.
 102. The methodof claim 99 further comprising an allocation control mechanism whereinthe allocation control mechanism separates the elements with temporallocality from the elements with spatial locality.
 103. The method ofclaim 99 further comprising the step of storing the elements withtemporal locality in a cache memory.
 104. The method of claim 99 furthercomprising the step of striding through the elements with spatiallocality.
 105. The method of claim 99 wherein the elements of the firstarray and the elements of the third array maintain spatial locality.106. The method of claim 99 wherein the elements of the vector maintaintemporal locality.
 107. A computer system, comprising a microprocessorand a medium containing instructions, wherein the instructions, whenexecuted by a computer, cause the computer to multiply a matrix havingrows and columns containing non-zero elements and zero elements by avector and produce a resulting vector, by performing the steps of:creating a first array of elements containing the non-zero elements ofthe matrix; creating a second array of elements containing the rowposition of the last non-zero element in each row of the matrix;creating a third array of elements containing the column position ofeach non-zero element of the matrix; initializing a first variable;initializing a second variable; executing a set of instructions for anindex incremented from 1 to the last element of the second array, inincrements of 1, if the index is less than or equal to the element ofthe second array corresponding to the first variable, equating thesecond variable to the sum of the second variable and the product of theelement of the first array corresponding to the index, and the elementof the vector corresponding to the element of the third arraycorresponding to the index; if the index is greater than the element ofthe second array corresponding to the first variable, equating theelement of the resulting vector corresponding to the first variable tothe second variable, incrementing the first variable by 1, and equatingthe second variable to a particular value.
 108. The computer system ofclaim 107 wherein the set of instructions is predicated.
 109. Thecomputer system of claim 107 further comprising the step of prefetchingthe elements of the first array and the elements of the third array frommemory.
 110. The computer system of claim 107 further comprising anallocation control mechanism wherein the allocation control mechanismseparates the elements with temporal locality from the elements withspatial locality.
 111. The computer system of claim 107 furthercomprising the step of storing the elements with temporal locality in acache memory.
 112. The computer system of claim 107 further comprisingthe step of striding through the elements with spatial locality. 113.The computer system of claim 107 wherein the elements of the first arrayand the elements of the third array maintain spatial locality.
 114. Thecomputer system of claim 107 wherein the elements of the vector maintaintemporal locality.
 115. A computer readable medium for storinginstructions which, when executed by a computer, cause the computer tomultiply a matrix having rows and columns containing non-zero elementsand zero elements by a vector and produce a resulting vector, byperforming the steps of: creating a first array of elements containingthe non-zero elements of the matrix; creating a second array of elementscontaining the row position of the last non-zero element in each row ofthe matrix; creating a third array of elements containing the columnposition of each non-zero element of the matrix; initializing a firstvariable; initializing a second variable; initializing a third variable;executing a set of instructions for an index incremented from 1 to thelast element of the second array, in increments of 2, if the index isless than or equal to the element of the second array corresponding tothe first variable, equating the second variable to the sum of the thirdvariable and the product of the element of the first array correspondingto the index, and the element of the vector corresponding to the elementof the third array corresponding to the index; if the index is greaterthan the element of the second array corresponding to the firstvariable, equating the element of the resulting vector corresponding tothe first variable to the third variable, incrementing the firstvariable by 1, and equating the second variable to the product of theelement of the first array corresponding to the index, and the elementof the vector corresponding to the element of the third arraycorresponding to the index; if the index+1 is less than or equal to theelement of the second array corresponding to the first variable,equating the third variable to the sum of the second variable and theproduct of the element of the first array corresponding to the index+1,and the element of the vector corresponding to the element of the thirdarray corresponding to the index+1; if the index+1 is greater than theelement of the second array corresponding to the first variable,equating the element of the resulting vector corresponding to the firstvariable to the second variable, incrementing the first variable by 1,and equating the third variable to the product of the element of thefirst array corresponding to the index+1, and the element of the vectorcorresponding to the element of the third array corresponding to theindex+1; and then equating the element of the resulting vectorcorresponding to the first variable to the third variable when the lastelement of the second array is even.
 116. The computer readable mediumof claim 15 wherein the set of instructions is predicated.
 117. Thecomputer readable medium of claim 115 further comprising the step ofprefetching the elements of the first array and the elements of thethird array from memory.
 118. The computer readable medium of claim 115further comprising an allocation control mechanism wherein theallocation control mechanism separates the elements with temporallocality from the elements with spatial locality.
 119. The computerreadable medium of claim 115 further comprising the step of storing theelements with temporal locality in a cache memory.
 120. The computerreadable medium of claim 115 further comprising the step of stridingthrough the elements with spatial locality.
 121. The computer readablemedium of claim 115 wherein the elements of the first array and theelements of the third array maintain spatial locality.
 122. The computerreadable medium of claim 15 wherein the elements of the vector maintaintemporal locality.
 123. A method for causing a computer to multiply amatrix having rows and columns containing non-zero elements and zeroelements by a vector and produce a resulting vector, comprising thesteps of: creating a first array of elements containing the non-zeroelements of the matrix; creating a second array of elements containingthe row position of the last non-zero element in each row of the matrix;creating a third array of elements containing the column position ofeach non-zero element of the matrix; initializing a first variable;initializing a second variable; initializing a third variable; executinga set of instructions for an index incremented from 1 to the lastelement of the second array, in increments of 2, if the index is lessthan or equal to the element of the second array corresponding to thefirst variable, equating the second variable to the sum of the thirdvariable and the product of the element of the first array correspondingto the index, and the element of the vector corresponding to the elementof the third array corresponding to the index; if the index is greaterthan the element of the second array corresponding to the firstvariable, equating the element of the resulting vector corresponding tothe first variable to the third variable, incrementing the firstvariable by 1, and equating the second variable to the product of theelement of the first array corresponding to the index, and the elementof the vector corresponding to the element of the third arraycorresponding to the index; if the index+1 is less than or equal to theelement of the second array corresponding to the first variable,equating the third variable to the sum of the second variable and theproduct of the element of the first array corresponding to the index+1,and the element of the vector corresponding to the element of the thirdarray corresponding to the index+1; if the index+1 is greater than theelement of the second array corresponding to the first variable,equating the element of the resulting vector corresponding to the firstvariable to the second variable, incrementing the first variable by 1,and equating the third variable to the product of the element of thefirst array corresponding to the index+1, and the element of the vectorcorresponding to the element of the third array corresponding to theindex+1; and then equating the element of the resulting vectorcorresponding to the first variable to the third variable only if thelast element of the second array is even.
 124. The method of claim 123wherein the set of instructions is predicated.
 125. The method of claim123 further comprising the step of prefetching the elements of the firstarray and the elements of the third array from memory.
 126. The methodof claim 123 further comprising an allocation control mechanism whereinthe allocation control mechanism separates the elements with temporallocality from the elements with spatial locality.
 127. The method ofclaim 123 further comprising the step of storing the elements withtemporal locality in a cache memory.
 128. The method of claim 123further comprising the step of striding through the elements withspatial locality.
 129. The method of claim 123 wherein the elements ofthe first array and the elements of the third array maintain spatiallocality.
 130. The method of claim 123 wherein the elements of thevector maintain temporal locality.
 131. A computer system, comprising amicroprocessor and a medium containing instructions, wherein theinstructions, when executed by a computer, cause the computer tomultiply a matrix having rows and columns containing non-zero elementsand zero elements by a vector and produce a resulting vector, byperforming the steps of: creating a first array of elements containingthe non-zero elements of the matrix; creating a second array of elementscontaining the row position of the last non-zero element in each row ofthe matrix; creating a third array of elements containing the columnposition of each non-zero element of the matrix; initializing a firstvariable; initializing a second variable; initializing a third variable;executing a set of instructions for an index incremented from 1 to thelast element of the second array, in increments of 2, if the index isless than or equal to the element of the second array corresponding tothe first variable, equating the second variable to the sum of the thirdvariable and the product of the element of the first array correspondingto the index, and the element of the vector corresponding to the elementof the third array corresponding to the index; if the index is greaterthan the element of the second array corresponding to the firstvariable, equating the element of the resulting vector corresponding tothe first variable to the third variable, incrementing the firstvariable by 1, and equating the second variable to the product of theelement of the first array corresponding to the index, and the elementof the vector corresponding to the element of the third arraycorresponding to the index; if the index+1 is less than or equal to theelement of the second array corresponding to the first variable,equating the third variable to the sum of the second variable and theproduct of the element of the first array corresponding to the index+1,and the element of the vector corresponding to the element of the thirdarray corresponding to the index+1; if the index+1 is greater than theelement of the second array corresponding to the first variable,equating the element of the resulting vector corresponding to the firstvariable to the second variable, incrementing the first variable by 1,and equating the third variable to the product of the element of thefirst array corresponding to the index+1, and the element of the vectorcorresponding to the element of the third array corresponding to theindex+1; and then equating the element of the resulting vectorcorresponding to the first variable to the third variable when the lastelement of the second array is even.
 132. The computer system of claim131 wherein the set of instructions is predicated.
 133. The computersystem of claim 131 further comprising the step of prefetching theelements of the first array and the elements of the third array frommemory.
 134. The computer system of claim 131 further comprising anallocation control mechanism wherein the allocation control mechanismseparates the elements with temporal locality from the elements withspatial locality.
 135. The computer system of claim 131 furthercomprising the step of storing the elements with temporal locality in acache memory.
 136. The computer system of claim 131 further comprisingthe step of striding through the elements with spatial locality. 137.The computer system of claim 131 wherein the elements of the first arrayand the elements of the third array maintain spatial locality.
 138. Thecomputer system of claim 131 wherein the elements of the vector maintaintemporal locality.